What is Customer Churn?
Customer churn is the percentage of customers that stopped using company's product or service during a certain time frame. Customer churn is one of the most important metrics for a growing business to evaluate as it is much less expensive to retain existing customers than it is to acquire new customers. Customers in the telecom industry can choose from a variety of service providers and actively switch from one to the next. The telecommunications business has an annual churn rate of 15-25 percent in this highly competitive market.
Customer churn is extremley costly for companies. Based on a churn rate just under two percent for top companies, one source estimates carriers lose $65 million per month from churn. To reduce customer churn, telecom companies should predict which customers are highly prone to churn.
Individualized customer retention is demanding because most companies have a large number of customers and cannot afford to devote much time to each of them. The costs would be too great, outweighing the additional revenue. However, if a corporation could forecast which customers are likely to leave ahead of time, it could concentrate customer retention efforts only on these "high risk" clients.
In this projects below questions will be answered:
Customer ID
: A unique ID that identifies each customer.Demographic info about customers:
gender
: Whether the customer is a male or a female
SeniorCitizen
: Whether the customer is a senior citizen or not (1, 0)
Partner
: Whether the customer has a partner or not (Yes, No)
Dependents
: Whether the customer has dependents or not (Yes, No)
Services that each customer has signed up for:
PhoneService
: Whether the customer has a phone service or not (Yes, No)
MultipleLines
: Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService
: Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity
: Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup
: Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection
: Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport
: Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV
: Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies
: Whether the customer has streaming movies or not (Yes, No, No internet service)
Customer account information:
tenure
: Number of months the customer has stayed with the company
Contract
: The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling
: Whether the customer has paperless billing or not (Yes, No)
PaymentMethod
: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges
: The amount charged to the customer monthly
TotalCharges
: The total amount charged to the customer
Churn
: Target, Whether the customer has left within the last month or not (Yes or No)
!pip install mlens
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: mlens in /usr/local/lib/python3.7/dist-packages (0.2.3) Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.7/dist-packages (from mlens) (1.21.6) Requirement already satisfied: scipy>=0.17 in /usr/local/lib/python3.7/dist-packages (from mlens) (1.4.1)
# handle table-like data and matrices
import pandas as pd
import numpy as np
# visualisation
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode(connected=True)
# preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
# balance data
from imblearn.over_sampling import BorderlineSMOTE
# models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier, StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from mlens.ensemble import SuperLearner
from sklearn.neural_network import MLPClassifier
# evaluations
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_auc_score, plot_roc_curve, roc_curve, auc
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
# ignore warnings
import warnings
warnings.filterwarnings('ignore')
# to display the total number columns present in the dataset
pd.set_option('display.max_columns', None)
data = pd.read_csv('Telco Customer Churn.csv')
let's find if we have missing values in the dataset.
data = data.replace(r'^\s*$', np.nan, regex=True)
data.isnull().sum()
customerID 0 gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 11 Churn 0 dtype: int64
msno.matrix(data);
If we examine the data carefully, we can actually estimate the value of the missing data.
Contract length in month * tenure (if not 0) * monthly charges
This is more accurate than filling missing values with mean or median.
data[data['TotalCharges'].isnull()].index.tolist()
[488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754]
ind = data[data['TotalCharges'].isnull()].index.tolist()
for i in ind:
if data['Contract'].iloc[i,] == 'Two year':
data['TotalCharges'].iloc[i,] = int(np.maximum(data['tenure'].iloc[i,], 1)) * data['MonthlyCharges'].iloc[i,] * 24
elif data['Contract'].iloc[i,] == 'One year':
data['TotalCharges'].iloc[i,] = int(np.maximum(data['tenure'].iloc[i,], 1)) * data['MonthlyCharges'].iloc[i,] * 12
else:
data['TotalCharges'].iloc[i,] = int(np.maximum(data['tenure'].iloc[i,], 1)) * data['MonthlyCharges'].iloc[i,]
data.isnull().sum()
customerID 0 gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 0 Churn 0 dtype: int64
let's find if we have duplicate rows.
data.duplicated().sum()
0
data.head(3)
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
data.shape
(7043, 21)
There are 7043 cutomers and 21 features in the dataset.
for i in data.columns[6:-3]:
print(f'Number of categories in the variable {i}: {len(data[i].unique())}')
Number of categories in the variable PhoneService: 2 Number of categories in the variable MultipleLines: 3 Number of categories in the variable InternetService: 3 Number of categories in the variable OnlineSecurity: 3 Number of categories in the variable OnlineBackup: 3 Number of categories in the variable DeviceProtection: 3 Number of categories in the variable TechSupport: 3 Number of categories in the variable StreamingTV: 3 Number of categories in the variable StreamingMovies: 3 Number of categories in the variable Contract: 3 Number of categories in the variable PaperlessBilling: 2 Number of categories in the variable PaymentMethod: 4
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
data.describe()
SeniorCitizen | tenure | MonthlyCharges | |
---|---|---|---|
count | 7043.000000 | 7043.000000 | 7043.000000 |
mean | 0.162147 | 32.371149 | 64.761692 |
std | 0.368612 | 24.559481 | 30.090047 |
min | 0.000000 | 0.000000 | 18.250000 |
25% | 0.000000 | 9.000000 | 35.500000 |
50% | 0.000000 | 29.000000 | 70.350000 |
75% | 0.000000 | 55.000000 | 89.850000 |
max | 1.000000 | 72.000000 | 118.750000 |
data.describe(include=object).T
count | unique | top | freq | |
---|---|---|---|---|
customerID | 7043 | 7043 | 7590-VHVEG | 1 |
gender | 7043 | 2 | Male | 3555 |
Partner | 7043 | 2 | No | 3641 |
Dependents | 7043 | 2 | No | 4933 |
PhoneService | 7043 | 2 | Yes | 6361 |
MultipleLines | 7043 | 3 | No | 3390 |
InternetService | 7043 | 3 | Fiber optic | 3096 |
OnlineSecurity | 7043 | 3 | No | 3498 |
OnlineBackup | 7043 | 3 | No | 3088 |
DeviceProtection | 7043 | 3 | No | 3095 |
TechSupport | 7043 | 3 | No | 3473 |
StreamingTV | 7043 | 3 | No | 2810 |
StreamingMovies | 7043 | 3 | No | 2785 |
Contract | 7043 | 3 | Month-to-month | 3875 |
PaperlessBilling | 7043 | 2 | Yes | 4171 |
PaymentMethod | 7043 | 4 | Electronic check | 2365 |
TotalCharges | 7043 | 6541 | 20.2 | 11 |
Churn | 7043 | 2 | No | 5174 |
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=data['gender'].unique(), values=data['gender'].value_counts(), name='Gender',
marker_colors=['gold', 'mediumturquoise']), 1, 1)
fig.add_trace(go.Pie(labels=data['Churn'].unique(), values=data['Churn'].value_counts(), name='Churn',
marker_colors=['darkorange', 'lightgreen']), 1, 2)
fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2)))
fig.update_layout(
title_text='<b>Gender and Churn Distributions<b>',
# Add annotations in the center of the donut pies.
annotations=[dict(text='Gender', x=0.19, y=0.5, font_size=20, showarrow=False),
dict(text='Churn', x=0.8, y=0.5, font_size=20, showarrow=False)])
iplot(fig)
We have imbalanced data.
$26.6 \%$ of customers switched to another company.
Customers are $49.5 \%$ female and $50.5 \%$ male.
fig = px.sunburst(data, path=['Churn', 'gender'], title='<b>Sunburst Plot of Gender and churn<b>')
iplot(fig)
print(f'A female customer has a probability of {round(data[(data["gender"] == "Female") & (data["Churn"] == "Yes")].count()[0] / data[(data["gender"] == "Female")].count()[0] *100,2)} % churn')
print(f'A male customer has a probability of {round(data[(data["gender"] == "Male") & (data["Churn"] == "Yes")].count()[0] / data[(data["gender"] == "Male")].count()[0]*100,2)} % churn')
A female customer has a probability of 26.92 % churn A male customer has a probability of 26.16 % churn
fig = px.histogram(data, x='Churn', color='Contract', barmode='group', title='<b>Customer Contract Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['#EC7063','#E9F00B','#0BF0D1'], text_auto=True)
fig.update_layout(width=1100, height=500, bargap=0.3)
fig.update_traces(marker_line_width=2,marker_line_color='black')
iplot(fig)
print(f'A customer with month-to-month contract has a probability of {round(data[(data["Contract"] == "Month-to-month") & (data["Churn"] == "Yes")].count()[0] / data[(data["Contract"] == "Month-to-month")].count()[0] *100,2)} % churn')
print(f'A customer with one year contract has a probability of {round(data[(data["Contract"] == "One year") & (data["Churn"] == "Yes")].count()[0] / data[(data["Contract"] == "One year")].count()[0]*100,2)} % churn')
print(f'A customer with two year contract has a probability of {round(data[(data["Contract"] == "Two year") & (data["Churn"] == "Yes")].count()[0] / data[(data["Contract"] == "Two year")].count()[0]*100,2)} % churn')
A customer with month-to-month contract has a probability of 42.71 % churn A customer with one year contract has a probability of 11.27 % churn A customer with two year contract has a probability of 2.83 % churn
fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]])
fig.add_trace(go.Pie(labels=data['PaymentMethod'].unique(), values=data['PaymentMethod'].value_counts(), name='Payment Method',
marker_colors=['gold', 'mediumturquoise','darkorange', 'lightgreen']), 1, 1)
fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2)))
fig.update_layout(
title_text='<b>Payment Method Distributions<b>',
annotations=[dict(text='Payment Method', x=0.5, y=0.5, font_size=18, showarrow=False)])
iplot(fig)
fig = px.histogram(data, x='Churn', color='PaymentMethod', barmode='group', title='<b>Payment Method Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['#EC7063', '#0BF0D1', '#E9F00B', '#5DADE2'], text_auto=True)
fig.update_layout(width=1100, height=500, bargap=0.3)
fig.update_traces(marker_line_width=2,marker_line_color='black')
iplot(fig)
print(f'A customer that use Electronic check for paying has a probability of {round(data[(data["PaymentMethod"] == "Electronic check") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaymentMethod"] == "Electronic check")].count()[0] *100,2)} % churn')
print(f'A customer that use Mailed check for paying has a probability of {round(data[(data["PaymentMethod"] == "Mailed check") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaymentMethod"] == "Mailed check")].count()[0]*100,2)} % churn')
print(f'A customer that use Bank transfer (automatic) for paying has a probability of {round(data[(data["PaymentMethod"] == "Bank transfer (automatic)") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaymentMethod"] == "Bank transfer (automatic)")].count()[0]*100,2)} % churn')
print(f'A customer that use Credit card (automatic) for paying has a probability of {round(data[(data["PaymentMethod"] == "Credit card (automatic)") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaymentMethod"] == "Credit card (automatic)")].count()[0]*100,2)} % churn')
A customer that use Electronic check for paying has a probability of 45.29 % churn A customer that use Mailed check for paying has a probability of 19.11 % churn A customer that use Bank transfer (automatic) for paying has a probability of 16.71 % churn A customer that use Credit card (automatic) for paying has a probability of 15.24 % churn
Major customers who moved out had Electronic Check as Payment Method.
Customers who chose Credit-Card automatic transfer or Bank Automatic Transfer and Mailed Check as Payment Method were less likely to move out.
data[data['gender']=='Male'][['InternetService', 'Churn']].value_counts()
InternetService Churn DSL No 993 Fiber optic No 910 No No 722 Fiber optic Yes 633 DSL Yes 240 No Yes 57 dtype: int64
data[data['gender']=='Female'][['InternetService', 'Churn']].value_counts()
InternetService Churn DSL No 969 Fiber optic No 889 No No 691 Fiber optic Yes 664 DSL Yes 219 No Yes 56 dtype: int64
fig = go.Figure()
fig.add_trace(go.Bar(
x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
['Female', 'Male', 'Female', 'Male']],
y = [965, 992, 219, 240],
name = 'DSL',
))
fig.add_trace(go.Bar(
x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
['Female', 'Male', 'Female', 'Male']],
y = [889, 910, 664, 633],
name = 'Fiber optic',
))
fig.add_trace(go.Bar(
x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
['Female', 'Male', 'Female', 'Male']],
y = [690, 717, 56, 57],
name = 'No Internet',
))
fig.update_layout(title_text='<b>Churn Distribution w.r.t. Internet Service and Gender</b>')
fig.update_traces(marker_line_width=2,marker_line_color='black')
iplot(fig)
A lot of customers choose the Fiber optic service and it's also evident that the customers who use Fiber optic have high churn rate, this might suggest a dissatisfaction with this type of internet service.
Customers having DSL service are majority in number and have less churn rate compared to Fibre optic service.
fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]])
fig.add_trace(go.Pie(labels=data['Dependents'].unique(), values=data['Dependents'].value_counts(), name='Dependents',
marker_colors=['#E5527A ', '#AAB7B8']), 1, 1)
fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2)))
fig.update_layout(
title_text='<b>Dependents Distribution<b>',
annotations=[dict(text='Dependents', x=0.5, y=0.5, font_size=18, showarrow=False)])
iplot(fig)
fig = px.histogram(data, x='Dependents', color='Churn', barmode='group', title='<b>Dependents Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['#00CC96','#FFA15A'], text_auto=True)
fig.update_layout(width=1100, height=500, bargap=0.3)
fig.update_traces(marker_line_width=2,marker_line_color='black')
iplot(fig)
print(f'A customer with dependents has a probability of {round(data[(data["Dependents"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["Dependents"] == "Yes")].count()[0] *100,2)} % churn')
print(f'A customer without dependents has a probability of {round(data[(data["Dependents"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["Dependents"] == "No")].count()[0]*100,2)} % churn')
A customer with dependents has a probability of 15.45 % churn A customer without dependents has a probability of 31.28 % churn
fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]])
fig.add_trace(go.Pie(labels=data['Partner'].unique(), values=data['Partner'].value_counts(), name='Partner',
marker_colors=['gold', 'purple']), 1, 1)
fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2)))
fig.update_layout(
title_text='<b>Partner Distribution<b>',
annotations=[dict(text='Partner', x=0.5, y=0.5, font_size=18, showarrow=False)])
iplot(fig)
fig = px.histogram(data, x='Churn', color='Partner', barmode='group', title='<b>Partner Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['#C82735','#BCC827'], text_auto=True)
fig.update_layout(width=1100, height=500, bargap=0.3)
fig.update_traces(marker_line_width=2,marker_line_color='black')
iplot(fig)
print(f'A customer with a partner has a probability of {round(data[(data["Partner"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["Partner"] == "Yes")].count()[0] *100,2)} % churn')
print(f'A customer without a partner has a probability of {round(data[(data["Partner"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["Partner"] == "No")].count()[0]*100,2)} % churn')
A customer with a partner has a probability of 19.66 % churn A customer without a partner has a probability of 32.96 % churn
fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]])
fig.add_trace(go.Pie(labels=['No', 'Yes'], values=data['SeniorCitizen'].value_counts(), name='Senior Citizen',
marker_colors=['#56E11A', '#1A87E1']), 1, 1)
fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2)))
fig.update_layout(
title_text='<b>Senior Citizen Distribution<b>',
annotations=[dict(text='Senior Citizen', x=0.5, y=0.5, font_size=18, showarrow=False)])
iplot(fig)
fig = px.histogram(data, x='Churn', color='SeniorCitizen', barmode='group', title='<b>Senior Citizen Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['#E11AC6','#BAE11A'], text_auto=True)
fig.update_layout(width=1100, height=500, bargap=0.3)
fig.update_traces(marker_line_width=2,marker_line_color='black')
iplot(fig)
print(f'A customer that is a senior citizen has a probability of {round(data[(data["SeniorCitizen"] == 1) & (data["Churn"] == "Yes")].count()[0] / data[(data["SeniorCitizen"] == 1)].count()[0] *100,2)} % churn')
print(f'A customer that is not a senior citizen has a probability of {round(data[(data["SeniorCitizen"] == 0) & (data["Churn"] == "Yes")].count()[0] / data[(data["SeniorCitizen"] == 0)].count()[0]*100,2)} % churn')
A customer that is a senior citizen has a probability of 41.68 % churn A customer that is not a senior citizen has a probability of 23.61 % churn
It can be observed that the fraction of senior citizen is very less.
About $42\%$ of the senior citizens churn.
fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]])
fig.add_trace(go.Pie(labels=data['OnlineSecurity'].unique(), values=data['OnlineSecurity'].value_counts(), name='OnlineSecurity',
marker_colors=['#1AE178', '#2CECE6', 'red']), 1, 1)
fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2)))
fig.update_layout(
title_text='<b>Online Security Distribution<b>',
annotations=[dict(text='Online Security', x=0.5, y=0.5, font_size=18, showarrow=False)])
iplot(fig)
fig = px.histogram(data, x='Churn', color='OnlineSecurity', barmode='group', title='<b>Online Security Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['#EB984E','yellow', '#5499C7'], text_auto=True)
fig.update_layout(width=1100, height=500, bargap=0.3)
fig.update_traces(marker_line_width=2,marker_line_color='black')
iplot(fig)
print(f'A customer with an online security has a probability of {round(data[(data["OnlineSecurity"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["OnlineSecurity"] == "Yes")].count()[0] *100,2)} % churn')
print(f'A customer without an online security has a probability of {round(data[(data["OnlineSecurity"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["OnlineSecurity"] == "No")].count()[0]*100,2)} % churn')
print(f'A customer with no internet service has a probability of {round(data[(data["OnlineSecurity"] == "No internet service") & (data["Churn"] == "Yes")].count()[0] / data[(data["OnlineSecurity"] == "No internet service")].count()[0]*100,2)} % churn')
A customer with an online security has a probability of 14.61 % churn A customer without an online security has a probability of 41.77 % churn A customer with no internet service has a probability of 7.4 % churn
fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]])
fig.add_trace(go.Pie(labels=data['PaperlessBilling'].unique(), values=data['PaperlessBilling'].value_counts(), name='PaperlessBilling',
marker_colors=['LightCoral', '#CCCCFF']), 1, 1)
fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2)))
fig.update_layout(
title_text='<b>PaperlessBilling Distribution<b>',
annotations=[dict(text='PaperlessBilling Security', x=0.5, y=0.5, font_size=14, showarrow=False)])
iplot(fig)
fig = px.histogram(data, x='Churn', color='PaperlessBilling', barmode='group', title='<b>Paperless Billing Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['#9FE2BF', '#FF7F50'], text_auto=True)
fig.update_layout(width=1100, height=500, bargap=0.3)
fig.update_traces(marker_line_width=2,marker_line_color='black')
iplot(fig)
print(f'A customer with PaperlessBilling has a probability of {round(data[(data["PaperlessBilling"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaperlessBilling"] == "Yes")].count()[0] *100,2)} % churn')
print(f'A customer without PaperlessBilling has a probability of {round(data[(data["PaperlessBilling"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaperlessBilling"] == "No")].count()[0]*100,2)} % churn')
A customer with PaperlessBilling has a probability of 33.57 % churn A customer without PaperlessBilling has a probability of 16.33 % churn
fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]])
fig.add_trace(go.Pie(labels=data['TechSupport'].unique(), values=data['TechSupport'].value_counts(), name='TechSupport',
marker_colors=['#DE3163', '#DFFF00', '#40E0D0']), 1, 1)
fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2)))
fig.update_layout(
title_text='<b>TechSupport Distribution<b>',
annotations=[dict(text='Tech Support', x=0.5, y=0.5, font_size=18, showarrow=False)])
iplot(fig)
fig = px.histogram(data, x='Churn', color='TechSupport', barmode='group', title='<b>Tech Support Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['#FFBF00', 'IndianRed', 'red'], text_auto=True)
fig.update_layout(width=1100, height=500, bargap=0.3)
fig.update_traces(marker_line_width=2,marker_line_color='black')
iplot(fig)
print(f'A customer with a tech support has a probability of {round(data[(data["TechSupport"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["TechSupport"] == "Yes")].count()[0] *100,2)} % churn')
print(f'A customer without a tech support has a probability of {round(data[(data["TechSupport"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["TechSupport"] == "No")].count()[0]*100,2)} % churn')
print(f'A customer with no internet service has a probability of {round(data[(data["TechSupport"] == "No internet service") & (data["Churn"] == "Yes")].count()[0] / data[(data["TechSupport"] == "No internet service")].count()[0]*100,2)} % churn')
A customer with a tech support has a probability of 15.17 % churn A customer without a tech support has a probability of 41.64 % churn A customer with no internet service has a probability of 7.4 % churn
fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]])
fig.add_trace(go.Pie(labels=data['PhoneService'].unique(), values=data['PhoneService'].value_counts(), name='PhoneService',
marker_colors=['LightSalmon', '#7FB3D5']), 1, 1)
fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2)))
fig.update_layout(
title_text='<b>Phone Service Distribution<b>',
annotations=[dict(text='Phone Service', x=0.5, y=0.5, font_size=20, showarrow=False)])
iplot(fig)
fig = px.histogram(data, x='Churn', color='PhoneService', barmode='group', title='<b>Phone Service Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['#FFBF00', 'IndianRed'], text_auto=True)
fig.update_layout(width=1100, height=500, bargap=0.3)
fig.update_traces(marker_line_width=2,marker_line_color='black')
iplot(fig)
print(f'A customer with phone service has a probability of {round(data[(data["PhoneService"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["PhoneService"] == "Yes")].count()[0] *100,2)} % churn')
print(f'A customer without phone service has a probability of {round(data[(data["PhoneService"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["PhoneService"] == "No")].count()[0]*100,2)} % churn')
A customer with phone service has a probability of 26.71 % churn A customer without phone service has a probability of 24.93 % churn
fig = px.histogram(data, x='MonthlyCharges', color='Churn', marginal='box', title='<b>Monthly Charges Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['#84D57F', '#C959DA'])
iplot(fig)
fig = px.histogram(data, x='TotalCharges', color='Churn', marginal='box', title='<b>Total Charges Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['blue', 'red'])
iplot(fig)
fig = px.histogram(data, x='tenure', color='Churn', marginal='box', title='<b>Tenure Distribution w.r.t. Churn<b>',
color_discrete_sequence = ['orange', 'green'])
iplot(fig)
The presence of outliers in a classification or regression dataset can result in a poor fit and lower predictive modeling performance, therefore we should see there are ouliers in the data.
data=data.drop(labels=['customerID'],axis=1)
sns.distplot(data.TotalCharges);
sns.distplot(data.MonthlyCharges);
sns.distplot(data.tenure);
Another way of visualising outliers is using boxplots and whiskers, which provides the quantiles (box) and inter-quantile range (whiskers), with the outliers sitting outside the error bars (whiskers).
All the dots in the plot below are outliers according to the quantiles + 1.5 IQR rule
first let's specify the datatype of TotalCharges
as numerical.
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
fig = make_subplots(rows=1, cols=3)
fig.add_trace(go.Box(y=data['MonthlyCharges'], notched=True, name='Monthly Charges', marker_color = '#6699ff',
boxmean=True, boxpoints='suspectedoutliers'), 1, 2)
fig.add_trace(go.Box(y=data['TotalCharges'], notched=True, name='Total Charges', marker_color = '#ff0066',
boxmean=True, boxpoints='suspectedoutliers'), 1, 1)
fig.add_trace(go.Box(y=data['tenure'], notched=True, name='Tenure', marker_color = 'lightseagreen',
boxmean=True, boxpoints='suspectedoutliers'), 1, 3)
fig.update_layout(title_text='<b>Box Plots for Numerical Variables<b>')
iplot(fig)
def detect_outliers(d):
for i in d:
Q3, Q1 = np.percentile(data[i], [75 ,25])
IQR = Q3 - Q1
ul = Q3+1.5*IQR
ll = Q1-1.5*IQR
outliers = data[i][(data[i] > ul) | (data[i] < ll)]
print(f'*** {i} outlier points***', '\n', outliers, '\n')
detect_outliers(['tenure', 'MonthlyCharges', 'TotalCharges'])
*** tenure outlier points*** Series([], Name: tenure, dtype: int64) *** MonthlyCharges outlier points*** Series([], Name: MonthlyCharges, dtype: float64) *** TotalCharges outlier points*** Series([], Name: TotalCharges, dtype: float64)
There is no outlier.
Some categories may appear a lot in the dataset, whereas some other categories appear only in a few number of observations.
categorical = [var for var in data.columns if data[var].dtype=='O']
# check the number of different labels
for var in categorical:
print(data[var].value_counts() / np.float(len(data)))
print()
print()
Male 0.504756 Female 0.495244 Name: gender, dtype: float64 No 0.516967 Yes 0.483033 Name: Partner, dtype: float64 No 0.700412 Yes 0.299588 Name: Dependents, dtype: float64 Yes 0.903166 No 0.096834 Name: PhoneService, dtype: float64 No 0.481329 Yes 0.421837 No phone service 0.096834 Name: MultipleLines, dtype: float64 Fiber optic 0.439585 DSL 0.343746 No 0.216669 Name: InternetService, dtype: float64 No 0.496663 Yes 0.286668 No internet service 0.216669 Name: OnlineSecurity, dtype: float64 No 0.438450 Yes 0.344881 No internet service 0.216669 Name: OnlineBackup, dtype: float64 No 0.439443 Yes 0.343888 No internet service 0.216669 Name: DeviceProtection, dtype: float64 No 0.493114 Yes 0.290217 No internet service 0.216669 Name: TechSupport, dtype: float64 No 0.398978 Yes 0.384353 No internet service 0.216669 Name: StreamingTV, dtype: float64 No 0.395428 Yes 0.387903 No internet service 0.216669 Name: StreamingMovies, dtype: float64 Month-to-month 0.550192 Two year 0.240664 One year 0.209144 Name: Contract, dtype: float64 Yes 0.592219 No 0.407781 Name: PaperlessBilling, dtype: float64 Electronic check 0.335794 Mailed check 0.228880 Bank transfer (automatic) 0.219225 Credit card (automatic) 0.216101 Name: PaymentMethod, dtype: float64 No 0.73463 Yes 0.26537 Name: Churn, dtype: float64
As shown above, there is no rare category in the categorical variables.
data['Churn'] = data['Churn'].map({'Yes':1,'No':0})
data.dtypes
gender object SeniorCitizen int64 Partner object Dependents object tenure int64 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float64 TotalCharges float64 Churn int64 dtype: object
This step is the key to achieve a high accuracy. We use Target guided ordinal encoding
. Ordering the categories according to the target means assigning a number to the category, but this numbering, this ordering, is informed by the mean of the target within the category. Briefly, we calculate the mean of the target for each label/category, then we order the labels according to these mean from smallest to biggest, and we number them accordingly.
Advantages:
Disadvantage:
This process should be done on the train data, then the ordered label will be mapped into test data. (since the data is large enough, ordered categories will be same if we consider the whole data or just the train set.)
categorical = [var for var in data.columns if data[var].dtype=='O']
def category(df):
for var in categorical:
ordered_labels = df.groupby([var])['Churn'].mean().sort_values().index
ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)}
ordinal_label
df[var] = df[var].map(ordinal_label)
category(data)
data.head(5)
gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 3 | 29.85 | 29.85 | 0 |
1 | 0 | 0 | 1 | 1 | 34 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 2 | 2 | 1 | 0 | 2 | 56.95 | 1889.50 | 0 |
2 | 0 | 0 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 53.85 | 108.15 | 1 |
3 | 0 | 0 | 1 | 1 | 45 | 0 | 0 | 1 | 1 | 2 | 1 | 1 | 2 | 2 | 1 | 0 | 1 | 42.30 | 1840.75 | 0 |
4 | 1 | 0 | 1 | 1 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 3 | 70.70 | 151.65 | 1 |
fig = px.bar(x=data['Churn'].unique()[::-1], y=[data[data['Churn']==1].count()[0], data[data['Churn']==0].count()[0]],
text=[np.round(data[data['Churn']==1].count()[0]/data.shape[0], 4), np.round(data[data['Churn']==0].count()[0]/data.shape[0], 4)]
, color_discrete_sequence =['#ff9999'])
fig.update_layout(title_text='<b>Churn Count PLot<b>', xaxis = dict(tickmode = 'linear', tick0 = 0, dtick = 1),
width=700, height=400, bargap=0.4)
fig.update_layout({'yaxis': {'title':'Count'}, 'xaxis': {'title':'Churn'}})
iplot(fig)
As shown in the plot above, we are dealing with an imbalanced dataset. The BorderlineSMOTE
method is used which involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model. This method oversamples just those difficult instances, providing more resolution only where it may be required.
X = data.drop(['Churn'], axis = 1)
y = data['Churn']
oversample = BorderlineSMOTE()
X, y = oversample.fit_resample(X, y)
let's separate the data into training and testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
X_train.shape, X_test.shape
((9313, 19), (1035, 19))
In this section, numerical features are scaled.
StandardScaler = $\frac{x-\mu}{s}$
scaler = StandardScaler()
X_train[['TotalCharges','MonthlyCharges','tenure']] = scaler.fit_transform(X_train[['TotalCharges','MonthlyCharges','tenure']])
X_test[['TotalCharges','MonthlyCharges','tenure']] = scaler.transform(X_test[['TotalCharges','MonthlyCharges','tenure']])
CV = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
Model 1 : LR
LR_S = LogisticRegression(random_state = 42)
params_LR = {'C': list(np.arange(1,12)), 'penalty': ['l2', 'elasticnet', 'none'], 'class_weight': ['balanced','None']}
grid_LR = RandomizedSearchCV(LR_S, param_distributions=params_LR, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True)
grid_LR.fit(X_train, y_train)
print('Best parameters:', grid_LR.best_estimator_)
Best parameters: LogisticRegression(C=1, class_weight='None', random_state=42)
LR = LogisticRegression(random_state = 42, penalty= 'l2', class_weight= 'balanced', C=6)
cross_val_LR_Acc = cross_val_score(LR, X_train, y_train, cv = CV, scoring = 'accuracy')
cross_val_LR_f1 = cross_val_score(LR, X_train, y_train, cv = CV, scoring = 'f1')
cross_val_LR_AUC = cross_val_score(LR, X_train, y_train, cv = CV, scoring = 'roc_auc')
Model 2: Random Forest
RF_S = RandomForestClassifier(random_state = 42)
params_RF = {'n_estimators': list(range(50,100)), 'min_samples_leaf': list(range(1,5)), 'min_samples_split': list(range(1,5))}
grid_RF = RandomizedSearchCV(RF_S, param_distributions=params_RF, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True)
grid_RF.fit(X_train, y_train)
print('Best parameters:', grid_RF.best_estimator_)
Best parameters: RandomForestClassifier(n_estimators=65, random_state=42)
RF = RandomForestClassifier(n_estimators=70, random_state=42)
cross_val_RF_Acc = cross_val_score(RF, X_train, y_train, cv = CV, scoring = 'accuracy')
cross_val_RF_f1 = cross_val_score(RF, X_train, y_train, cv = CV, scoring = 'f1')
cross_val_RF_AUC = cross_val_score(RF, X_train, y_train, cv = CV, scoring = 'roc_auc')
Model 3: KNN
KNN_S = KNeighborsClassifier()
params_KNN = {'n_neighbors': list(range(1,20))}
grid_KNN = RandomizedSearchCV(KNN_S, param_distributions=params_KNN, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True)
grid_KNN.fit(X_train, y_train)
print('Best parameters:', grid_KNN.best_estimator_)
Best parameters: KNeighborsClassifier(n_neighbors=1)
KNN = KNeighborsClassifier(n_neighbors=1)
cross_val_KNN_Acc = cross_val_score(KNN, X_train, y_train, cv = CV, scoring = 'accuracy')
cross_val_KNN_f1 = cross_val_score(KNN, X_train, y_train, cv = CV, scoring = 'f1')
cross_val_KNN_AUC = cross_val_score(KNN, X_train, y_train, cv = CV, scoring = 'roc_auc')
Model 4: Decision Tree
DT_S = DecisionTreeClassifier(random_state=42)
params_DT = {'min_samples_leaf': list(range(1,6)), 'min_samples_split': list(range(1,6))}
grid_DT = RandomizedSearchCV(DT_S, param_distributions=params_DT, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True)
grid_DT.fit(X_train, y_train)
print('Best parameters:', grid_DT.best_estimator_)
Best parameters: DecisionTreeClassifier(random_state=42)
DT = DecisionTreeClassifier(random_state=42)
cross_val_DT_Acc = cross_val_score(DT, X_train, y_train, cv = CV, scoring = 'accuracy')
cross_val_DT_f1 = cross_val_score(DT, X_train, y_train, cv = CV, scoring = 'f1')
cross_val_DT_AUC = cross_val_score(DT, X_train, y_train, cv = CV, scoring = 'roc_auc')
Model 5: Ada Boost
AB_S = AdaBoostClassifier(random_state=42)
params_AB = {'n_estimators': list(np.arange(50,100,10)), 'learning_rate':[0.01, 0.1, 1]}
grid_AB = RandomizedSearchCV(AB_S, param_distributions=params_AB, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True)
grid_AB.fit(X_train, y_train)
print('Best parameters:', grid_AB.best_estimator_)
Best parameters: AdaBoostClassifier(learning_rate=1, n_estimators=90, random_state=42)
AB = AdaBoostClassifier(learning_rate=1, n_estimators=90, random_state=42)
cross_val_AB_Acc = cross_val_score(AB, X_train, y_train, cv = CV, scoring = 'accuracy')
cross_val_AB_f1 = cross_val_score(AB, X_train, y_train, cv = CV, scoring = 'f1')
cross_val_AB_AUC = cross_val_score(AB, X_train, y_train, cv = CV, scoring = 'roc_auc')
Model 6: XG Boost
XG_S = XGBClassifier(random_state=42)
params_XG = {'n_estimators': list(np.arange(50,150,10)), 'learning_rate':[0.01, 0.1, 1]}
grid_XG = RandomizedSearchCV(XG_S, param_distributions=params_XG, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True)
grid_XG.fit(X_train, y_train)
print('Best parameters:', grid_XG.best_estimator_)
Best parameters: XGBClassifier(learning_rate=1, n_estimators=130, random_state=42)
XG = XGBClassifier(learning_rate=1, n_estimators=120, random_state=42)
cross_val_XG_Acc = cross_val_score(XG, X_train, y_train, cv = CV, scoring = 'accuracy')
cross_val_XG_f1 = cross_val_score(XG, X_train, y_train, cv = CV, scoring = 'f1')
cross_val_XG_AUC = cross_val_score(XG, X_train, y_train, cv = CV, scoring = 'roc_auc')
Model 7: Extra Tree Classifier
ET_S = ExtraTreesClassifier(random_state=42)
params_ET = {'n_estimators': list(np.arange(50,150,10))}
grid_ET = RandomizedSearchCV(XG_S, param_distributions=params_ET, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True)
grid_ET.fit(X_train, y_train)
print('Best parameters:', grid_ET.best_estimator_)
Best parameters: XGBClassifier(n_estimators=140, random_state=42)
ET = ExtraTreesClassifier(n_estimators=140, random_state=42)
cross_val_ET_Acc = cross_val_score(ET, X_train, y_train, cv = CV, scoring = 'accuracy')
cross_val_ET_f1 = cross_val_score(ET, X_train, y_train, cv = CV, scoring = 'f1')
cross_val_ET_AUC = cross_val_score(ET, X_train, y_train, cv = CV, scoring = 'roc_auc')
Super Learner
SL = SuperLearner(folds=5, random_state=42)
SL.add([RF, XG, ET])
SuperLearner(array_check=None, backend=None, folds=5, layers=[Layer(backend='threading', dtype=<class 'numpy.float32'>, n_jobs=-1, name='layer-1', propagate_features=None, raise_on_exception=True, random_state=7270, shuffle=False, stack=[Group(backend='threading', dtype=<class 'numpy.float32'>, indexer=FoldIndex(X=None, folds=5, raise_on_ex...rer=None)], n_jobs=-1, name='group-0', raise_on_exception=True, transformers=[])], verbose=0)], model_selection=False, n_jobs=None, raise_on_exception=True, random_state=42, sample_size=20, scorer=None, shuffle=False, verbose=False)
SL.add_meta(MLPClassifier())
SuperLearner(array_check=None, backend=None, folds=5, layers=[Layer(backend='threading', dtype=<class 'numpy.float32'>, n_jobs=-1, name='layer-1', propagate_features=None, raise_on_exception=True, random_state=7270, shuffle=False, stack=[Group(backend='threading', dtype=<class 'numpy.float32'>, indexer=FoldIndex(X=None, folds=5, raise_on_ex...rer=None)], n_jobs=-1, name='group-1', raise_on_exception=True, transformers=[])], verbose=0)], model_selection=False, n_jobs=None, raise_on_exception=True, random_state=42, sample_size=20, scorer=None, shuffle=False, verbose=False)
cross_val_SL_Acc = cross_val_score(SL, X_train, y_train, cv = CV, scoring = 'accuracy')
cross_val_SL_f1 = cross_val_score(SL, X_train, y_train, cv = CV, scoring = 'f1')
cross_val_SL_AUC = cross_val_score(SL, X_train, y_train, cv = CV, scoring = 'roc_auc')
Stacking
estimators = [('DT', DT),
('RF', RF),
('ET', ET),
('LR', LR),
('KNN', KNN),
('XG', XG),
('AB', AB)]
Stack = StackingClassifier(estimators = estimators, final_estimator = MLPClassifier())
cross_val_ST_Acc = cross_val_score(Stack, X_train, y_train, cv = CV, scoring = 'accuracy')
cross_val_ST_f1 = cross_val_score(Stack, X_train, y_train, cv = CV, scoring = 'f1')
cross_val_ST_AUC = cross_val_score(Stack, X_train, y_train, cv = CV, scoring = 'roc_auc')
What features contribute more to predict the target (Churn)? let's find out how useful they are at predicting the target variable.
Random Forest algorithm offer importance scores based on the reduction in the criterion used to select split points, like Gini or entropy.
RF_I = RandomForestClassifier(n_estimators=70, random_state=42)
RF_I.fit(X, y)
RandomForestClassifier(n_estimators=70, random_state=42)
d = {'Features': X_train.columns, 'Feature Importance': RF_I.feature_importances_}
df = pd.DataFrame(d)
df_sorted = df.sort_values(by='Feature Importance', ascending = True)
df_sorted
df_sorted.style.background_gradient(cmap='Blues')
Features | Feature Importance | |
---|---|---|
5 | PhoneService | 0.008060 |
1 | SeniorCitizen | 0.016236 |
3 | Dependents | 0.018789 |
2 | Partner | 0.022521 |
15 | PaperlessBilling | 0.023343 |
10 | DeviceProtection | 0.024419 |
6 | MultipleLines | 0.024435 |
9 | OnlineBackup | 0.025233 |
12 | StreamingTV | 0.027024 |
0 | gender | 0.027551 |
8 | OnlineSecurity | 0.028928 |
11 | TechSupport | 0.029647 |
13 | StreamingMovies | 0.033099 |
7 | InternetService | 0.034272 |
16 | PaymentMethod | 0.045543 |
14 | Contract | 0.098061 |
17 | MonthlyCharges | 0.159143 |
4 | tenure | 0.164164 |
18 | TotalCharges | 0.189532 |
fig = px.bar(x=df_sorted['Feature Importance'], y=df_sorted['Features'], color_continuous_scale=px.colors.sequential.Blues,
title='<b>Feature Importance Based on Random Forest<b>', text_auto='.4f', color=df_sorted['Feature Importance'])
fig.update_traces(marker=dict(line=dict(color='black', width=2)))
fig.update_layout({'yaxis': {'title':'Features'}, 'xaxis': {'title':'Feature Importance'}})
iplot(fig)
compare_models = [('Logistic Regression', cross_val_LR_Acc.mean(),cross_val_LR_f1.mean(),cross_val_LR_AUC.mean(), ''),
('Random Forest', cross_val_RF_Acc.mean(),cross_val_RF_f1.mean(),cross_val_RF_AUC.mean(), ''),
('KNN', cross_val_KNN_Acc.mean(),cross_val_KNN_f1.mean(),cross_val_KNN_AUC.mean(), ''),
('Decision Tree', cross_val_DT_Acc.mean(), cross_val_DT_f1.mean(),cross_val_DT_AUC.mean(), ''),
('Ada Boost', cross_val_AB_Acc.mean(), cross_val_AB_f1.mean(),cross_val_AB_AUC.mean(), ''),
('XG Boost', cross_val_XG_Acc.mean(), cross_val_XG_f1.mean(),cross_val_XG_AUC.mean(), ''),
('Extra Tree', cross_val_ET_Acc.mean(), cross_val_ET_f1.mean(),cross_val_ET_AUC.mean(), ''),
('Super Learner', cross_val_SL_Acc.mean(), cross_val_SL_f1.mean(),cross_val_SL_AUC.mean(), ''),
('Stacking', cross_val_ST_Acc.mean(), cross_val_ST_f1.mean(),cross_val_ST_AUC.mean(), 'best model')]
compare = pd.DataFrame(data = compare_models, columns=['Model','Accuracy Mean', 'F1 Score Mean', 'AUC Score Mean', 'Description'])
compare.style.background_gradient(cmap='YlGn')
Model | Accuracy Mean | F1 Score Mean | AUC Score Mean | Description | |
---|---|---|---|---|---|
0 | Logistic Regression | 0.746267 | 0.759828 | 0.825137 | |
1 | Random Forest | 0.834962 | 0.841735 | 0.915320 | |
2 | KNN | 0.772686 | 0.785045 | 0.772846 | |
3 | Decision Tree | 0.782881 | 0.786659 | 0.783714 | |
4 | Ada Boost | 0.769136 | 0.781800 | 0.846343 | |
5 | XG Boost | 0.797485 | 0.803550 | 0.878836 | |
6 | Extra Tree | 0.827876 | 0.832871 | 0.908827 | |
7 | Super Learner | 0.833995 | 0.840168 | nan | |
8 | Stacking | 0.841942 | 0.844890 | 0.921315 | best model |
d1 = {'Logistic Regression':cross_val_LR_Acc, 'Random Forest':cross_val_RF_Acc, 'KNN':cross_val_KNN_Acc, 'Decision Tree':cross_val_DT_Acc,
'Ada Boost':cross_val_AB_Acc, 'XG Boost':cross_val_XG_Acc, 'Extra Tree':cross_val_ET_Acc, 'Super Learner':cross_val_SL_Acc,
'Stacking':cross_val_ST_Acc}
d_accuracy = pd.DataFrame(data = d1)
d2 = {'Logistic Regression':cross_val_LR_f1, 'Random Forest':cross_val_RF_f1, 'KNN':cross_val_KNN_f1, 'Decision Tree':cross_val_DT_f1,
'Ada Boost':cross_val_AB_f1, 'XG Boost':cross_val_XG_f1, 'Extra Tree':cross_val_ET_f1, 'Super Learner':cross_val_SL_f1,
'Stacking':cross_val_ST_f1}
d_f1 = pd.DataFrame(data = d2)
d3 = {'Logistic Regression':cross_val_LR_AUC, 'Random Forest':cross_val_RF_AUC, 'KNN':cross_val_KNN_AUC, 'Decision Tree':cross_val_DT_AUC,
'Ada Boost':cross_val_AB_AUC, 'XG Boost':cross_val_XG_AUC, 'Extra Tree':cross_val_ET_AUC, 'Super Learner':cross_val_SL_AUC,
'Stacking':cross_val_ST_AUC}
d_auc = pd.DataFrame(data = d3)
fig = go.Figure()
fig.add_trace(go.Box(name='Logistic Regression', y=d_accuracy.iloc[:,0]))
fig.add_trace(go.Box(name='Random Forest', y=d_accuracy.iloc[:,1]))
fig.add_trace(go.Box(name='KNN', y=d_accuracy.iloc[:,2]))
fig.add_trace(go.Box(name='Decision Tree', y=d_accuracy.iloc[:,3]))
fig.add_trace(go.Box(name='Ada Boost', y=d_accuracy.iloc[:,4]))
fig.add_trace(go.Box(name='XG Boost', y=d_accuracy.iloc[:,5]))
fig.add_trace(go.Box(name='Extra Tree', y=d_accuracy.iloc[:,6]))
fig.add_trace(go.Box(name='Super Learner', y=d_accuracy.iloc[:,7]))
fig.add_trace(go.Box(name='Stacking', y=d_accuracy.iloc[:,8]))
fig.update_traces(boxpoints='all', boxmean=True)
fig.update_layout(title_text='<b>Box Plots for Models Accuracy (train)<b>')
iplot(fig)
fig = go.Figure()
fig.add_trace(go.Box(name='Logistic Regression', y=d_f1.iloc[:,0]))
fig.add_trace(go.Box(name='Random Forest', y=d_f1.iloc[:,1]))
fig.add_trace(go.Box(name='KNN', y=d_f1.iloc[:,2]))
fig.add_trace(go.Box(name='Decision Tree', y=d_f1.iloc[:,3]))
fig.add_trace(go.Box(name='Ada Boost', y=d_f1.iloc[:,4]))
fig.add_trace(go.Box(name='XG Boost', y=d_f1.iloc[:,5]))
fig.add_trace(go.Box(name='Extra Tree', y=d_f1.iloc[:,6]))
fig.add_trace(go.Box(name='Super Learner', y=d_f1.iloc[:,7]))
fig.add_trace(go.Box(name='Stacking', y=d_f1.iloc[:,8]))
fig.update_traces(boxpoints='all', boxmean=True)
fig.update_layout(title_text='<b>Box Plots for Models F1 Score (train)<b>')
iplot(fig)
fig = go.Figure()
fig.add_trace(go.Box(name='Logistic Regression', y=d_auc.iloc[:,0]))
fig.add_trace(go.Box(name='Random Forest', y=d_auc.iloc[:,1]))
fig.add_trace(go.Box(name='KNN', y=d_auc.iloc[:,2]))
fig.add_trace(go.Box(name='Decision Tree', y=d_auc.iloc[:,3]))
fig.add_trace(go.Box(name='Ada Boost', y=d_auc.iloc[:,4]))
fig.add_trace(go.Box(name='XG Boost', y=d_auc.iloc[:,5]))
fig.add_trace(go.Box(name='Extra Tree', y=d_auc.iloc[:,6]))
fig.add_trace(go.Box(name='Stacking', y=d_auc.iloc[:,8]))
fig.update_traces(boxpoints='all', boxmean=True)
fig.update_layout(title_text='<b>Box Plots for Models AUC (train)<b>')
iplot(fig)
Stacking model is the most stable and accurate model. As a result, Stacking is selected for the purpose of predicting Churn.
Stack.fit(X_train, y_train)
y_pred = Stack.predict(X_test)
print(classification_report(y_test,y_pred))
precision recall f1-score support 0 0.86 0.83 0.85 505 1 0.85 0.87 0.86 530 accuracy 0.85 1035 macro avg 0.85 0.85 0.85 1035 weighted avg 0.85 0.85 0.85 1035
y_prob = Stack.predict_proba(X_test)
roc_auc_score(y_test, y_prob[:,1],average='macro')
0.9302073603586775
fpr, tpr, thresholds = roc_curve(y_test, y_prob[:,1])
fig = px.area(
x=fpr, y=tpr,
title=f'<b>ROC Curve (AUC={auc(fpr, tpr):.4f})<b>',
labels=dict(x='False Positive Rate', y='True Positive Rate'),
width=700, height=500, color_discrete_sequence=['#DA598A'])
fig.add_shape(
type='line', line=dict(dash='dash'),
x0=0, x1=1, y0=0, y1=1
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
iplot(fig)
cm = confusion_matrix(y_test, y_pred)
cm = cm.astype(int)
fig = ff.create_annotated_heatmap(z=cm[::-1], x=['No','Yes'], y=['Yes', 'No'], colorscale='Blues', annotation_text=cm[::-1])
fig.update_layout(title_text='<b>Confusion Matrix of Stacking Model<b>',
xaxis_title = 'Predicted value', yaxis_title = 'Real value', width=800, height=500)
iplot(fig)
We achieved about $86\%$ accuracy on the test.
Customer churn is definitely bad to a firm ’s profitability. Various strategies can be implemented to eliminate customer churn. The best way to avoid customer churn is for a company to truly know its customers. This includes identifying customers who are at risk of churning and working to improve their satisfaction. Improving customer service is, of course, at the top of the priority for tackling this issue. Building customer loyalty through relevant experiences and specialized service is another strategy to reduce customer churn. Some firms survey customers who have already churned to understand their reasons for leaving in order to adopt a proactive approach to avoiding future customer churn.